Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter OpenShift VMs by features #770

Merged
merged 1 commit into from
Nov 30, 2023
Merged

Conversation

rszwajko
Copy link
Contributor

@rszwajko rszwajko commented Oct 31, 2023

Supported features:

  1. NUMA
  2. GPUs/Host Devices
  3. Persistent TPM/EFI
  4. Dedicated CPU

Screenshot from 2023-11-30 12-12-25
Screenshot from 2023-11-30 12-12-49

@yaacov
Copy link
Member

yaacov commented Nov 5, 2023

@rszwajko looks nice, but it looks too detailed to me, a user need to do lots of clicks and decisions to select all vms with some feature.

@ahadas can we group some features that have things in common together (efi+tpm, cpu+gpu ... )?
should we do "vm has special features" selection option, and remove all granularity ?

@ahadas
Copy link
Member

ahadas commented Nov 6, 2023

@ahadas can we group some features that have things in common together (efi+tpm, cpu+gpu ... )? should we do "vm has special features" selection option, and remove all granularity ?

yeah, grouping features sounds like a good idea
the question is how - let's say that we have 'efi+tpm' group, does it mean we'll filter VMs that have either nvram or tpm?

thinking out aloud:
as I see it, this view is useful to assessing the ability to migrate the virtual machine before creating plans - get to this view and see which VMs have warnings / errors, that's more important than showing the features themselves since when we think about the features we can divide them to two groups:

  1. features that will be 'lost' during the migration, e.g., host devices. I don't see much value in filtering them here since users are unlikely to go back, e.g., to RHV and detach the host devices when they can just select those VMs in MTV and continue with their migration, knowing the host devices will not be attached to the migrated VMs.
  2. features that block the migration. there aren't many features like that, and we need to think about those that exist today (e.g., sr-iov is blocking migrations from vsphere but it doesn't block migrations from rhv.. same case for host/passthrough devices). it can be useful to filter by such features so users could make changes in the source environment in order to be able to migrate them, but the question is whether it makes sense at all to require users to do it or treat all features like those in the first group that are dropped if they're not supported by kubevirt (with a warning)

so maybe that's actually a question to ask - what value should filtering by features bring to users?

@yaacov
Copy link
Member

yaacov commented Nov 6, 2023

@ahadas thanks,

@rszwajko can we do finer filtering using the warnings / errors, IARC we only filter by existing of warning or error, we don't differentiate between specific errors or warnings, @ahadas does it makes sense to allow filtering by different error / warning types ?

@ahadas doe the scenario of making different plans for different VM gropups makes sense ?
"as an admin I want to create a different plan to migrate all the vms that require GPU and a different migration plan for the VMs that require special vnics"

@rszwajko
Copy link
Contributor Author

rszwajko commented Nov 6, 2023

@ahadas

this view is useful to assessing the ability to migrate the virtual machine before creating plans

The future plan is to replace Plan Wizard with this screen i.e. you filter the VMs and click "Migrate" button.

features that will be 'lost' during the migration, e.g., host devices.

For OpenShift providers we have the same capabilities on source and on the target so no feature need to be lost.
The target needs only to have the correct configuration in place.
For other providers this is more complex - that's why it's not supported yet (this PR is only about OpenShift)

what value should filtering by features bring to users?

The primary use case is a successful migration of VMs with features that are not covered by our mappings. Such features require manual config on the target(i.e. in a dedicated namespace). Most likely other VMs will need a different setup so it's better to keep both groups separated.

The flow could look the following:

  1. filter VMs that use some special feature (i.e. GPU)
  2. select the VMs and start the migration wizard
  3. pick a namespace on the target that have the given special feature pre-configured
  4. customize other migration option
  5. trigger the migration

@rszwajko
Copy link
Contributor Author

rszwajko commented Nov 6, 2023

@yaacov

can we do finer filtering using the warnings / errors, IARC we only filter by existing of warning or error, we don't differentiate between specific errors or warnings,

It's possible to do free-text filtering based on labels. The structure looks the following:

{
      category: 'Warning',
      label: 'Shareable disk detected',
      assessment:
        'Shared disks are only supported by certain OpenShift Virtualization storage configurations. Ensure that the correct storage is selected for the disk.',
    },

@ahadas
Copy link
Member

ahadas commented Nov 6, 2023

@rszwajko can we do finer filtering using the warnings / errors, IARC we only filter by existing of warning or error, we don't differentiate between specific errors or warnings, @ahadas does it makes sense to allow filtering by different error / warning types ?

yeah, that makes sense to me
if you do this by the label of a validation, like this, then we'll avoid making changes to the labels (we already rephrased/changed few assessments)

@ahadas doe the scenario of making different plans for different VM gropups makes sense ? "as an admin I want to create a different plan to migrate all the vms that require GPU and a different migration plan for the VMs that require special vnics"

it certainly makes sense to make different plans for different VM groups since we need a plan per target namespace, however, I don't think it would be based on the VM features/capabilities but rather based on their ownership/permissions

@ahadas
Copy link
Member

ahadas commented Nov 6, 2023

@ahadas

this view is useful to assessing the ability to migrate the virtual machine before creating plans

The future plan is to replace Plan Wizard with this screen i.e. you filter the VMs and click "Migrate" button.

ah ok, I didn't realize that's the context, thanks

features that will be 'lost' during the migration, e.g., host devices.

For OpenShift providers we have the same capabilities on source and on the target so no feature need to be lost. The target needs only to have the correct configuration in place. For other providers this is more complex - that's why it's not supported yet (this PR is only about OpenShift)

currently you're right as we assume we migrate between OCP clusters that have the same OCP version
I thought you wrote "OpenShift VMs" by mistake and you meant RHV/vSphere/OpenStack VMs, ok

what value should filtering by features bring to users?

The primary use case is a successful migration of VMs with features that are not covered by our mappings. Such features require manual config on the target(i.e. in a dedicated namespace). Most likely other VMs will need a different setup so it's better to keep both groups separated.

The flow could look the following:

  1. filter VMs that use some special feature (i.e. GPU)
  2. select the VMs and start the migration wizard
  3. pick a namespace on the target that have the given special feature pre-configured
  4. customize other migration option
  5. trigger the migration

I see, but then looking at the features above:

  1. Dedicated CPU - not namespace-scoped
  2. GPUs - I'm not sure if that's namespace-scoped
  3. Host devices - unsupported in kubevirt
  4. NUMA - not namespace-scoped
  5. Persistent EFI - not namespace-scoped
  6. Persistent TPM - not namespace-scoped

there is SR-IOV that is namespace-scoped but that's supposed to be covered by the network mapping (i.e., if we get a VM from the source that is mapped to a network-attachment-definition with SR-IOV, then we need to make sure that network is mapped to a network-attachment-definition with SR-IOV on the target)

@rszwajko
Copy link
Contributor Author

rszwajko commented Nov 6, 2023

@ahadas
I've browsed the official Kubevirt user guide and picked features that looked promising. The list requires more attention for sure.

As for your comments - note that most of those features require configuration at node level. Nodes in turn can be assigned to namespaces (see below). However even if we assume that this is a cluster wide setting it only means that decision is made at different level i.e.

  1. select the VMs
  2. pick a target cluster with given capability
  3. trigger the migration

Dedicated CPU - not namespace-scoped

The feature requires nodes with a running CPU manager.

GPUs - I'm not sure if that's namespace-scoped

Host devices - unsupported in kubevirt

See https://kubevirt.io/user-guide/virtual_machines/host-devices/
Note the user guide lists NVMe passthrough as one of the use cases and recommends nodeSelector to mitigate issues with assignment.

NUMA - not namespace-scoped

Requires dedicated CPU resources and hugepages which are node-level.

Persistent EFI - not namespace-scoped
Persistent TPM - not namespace-scoped

I've over-interpreted the docs. The setting seems per cluster.

@yaacov
Copy link
Member

yaacov commented Nov 7, 2023

Note:
when filtering VMs it is not just for namespace, it can also be for other reasons, for example migrating some VMs in phase 1 and doing other VMs in phase 2 and 3, IMHO filters that can help users to select different VMs for different migration phase based on resource can be helpful

@ahadas
Copy link
Member

ahadas commented Nov 7, 2023

@rszwajko ack

... Nodes in turn can be assigned to namespaces (see below). However even if we assume that this is a cluster wide setting it only means that decision is made at different level i.e.

  1. select the VMs
  2. pick a target cluster with given capability
  3. trigger the migration

can you point to me to where we see nodes can be assigned to namespaces? I'm not familiar with that configuration

about the steps above, in our case it's not like oVirt that we have a data center that contains multiple clusters and then we pick VMs in the first step and follow with picking a target cluster, but rather we pick the target cluster (cluster == OCP instance) and then pick the VMs to migrate to a namespace in that cluster

I doubt users would pick different OCP instances based on features, even though it's possible, but I think it's more likely that people would want to test things, e.g., test migration of a VM with vGPU, before initiating the full migration as @yaacov wrote above, so ok, I see the value in this now - it is worth mentioning it somewhere though

Dedicated CPU - not namespace-scoped

The feature requires nodes with a running CPU manager.

right, and CPU manager is enabled per-cluster

GPUs - I'm not sure if that's namespace-scoped
Host devices - unsupported in kubevirt

See https://kubevirt.io/user-guide/virtual_machines/host-devices/ Note the user guide lists NVMe passthrough as one of the use cases and recommends nodeSelector to mitigate issues with assignment.

ack, it didn't exist back then so we also need to adjust our validations - we currently still warn users that host devices won't be mapped

NUMA - not namespace-scoped

Requires dedicated CPU resources and hugepages which are node-level.

ok, but the problem with node-level is that we don't know, in MTV, the node-level specification in kubevirt

Persistent EFI - not namespace-scoped
Persistent TPM - not namespace-scoped

I've over-interpreted the docs. The setting seems per cluster.

here we have a bigger issue - we cannot get the persistent data from RHV and that from vSphere won't help us even if we could have got it, so regardless of the support for these features in kubevirt, we should warn about VMs with these features.

we thought at some point, and I proposed it in kubevirt user's mailing list, to try to identify which features are supported or enabled in kubevirt - this could allow us to produce different message, e.g., when TPM is not supported in kubvirt vs. when it's supported, but it didn't get much attention from the kubevirt side so it means we'll have to store, e.g., TPM is supported as from version X, however this is not that easy and we'll need to have that mapping also for k8s..

@rszwajko
Copy link
Contributor Author

rszwajko commented Nov 7, 2023

@ahadas

can you point to me to where we see nodes can be assigned to namespaces? I'm not familiar with that configuration

check out PodNodeSelector

Dedicated CPU - not namespace-scoped
The feature requires nodes with a running CPU manager.
right, and CPU manager is enabled per-cluster

My understanding is that a cluster can have multiple nodes and not all nodes may have a CPU manager. Generally running the CPU manager and labeling seems a manual process.

NUMA - not namespace-scoped
Requires dedicated CPU resources and hugepages which are node-level.
ok, but the problem with node-level is that we don't know, in MTV, the node-level specification in kubevirt

Yes and it's unlikely that we will cover such advanced features with mappings. Preparing such configuration will rather stay a manual process. We can however try to support the users in the process i.e. with filtering the right VMs and choosing the pre-configured target provider/namespace.

@ahadas
Copy link
Member

ahadas commented Nov 7, 2023

@ahadas

can you point to me to where we see nodes can be assigned to namespaces? I'm not familiar with that configuration

check out PodNodeSelector

ah ok, so they're not really "assigned" but rather "selected", like we can affect pod scheduling with taints and tolerations but here it's done on at the namespace-level, nice

Dedicated CPU - not namespace-scoped
The feature requires nodes with a running CPU manager.
right, and CPU manager is enabled per-cluster

My understanding is that a cluster can have multiple nodes and not all nodes may have a CPU manager. Generally running the CPU manager and labeling seems a manual process.

yes, the CPU manager runs on kubelet, i.e., on the nodes
I meant that the feature can be enabled at the cluster-level in kubevirt

NUMA - not namespace-scoped
Requires dedicated CPU resources and hugepages which are node-level.
ok, but the problem with node-level is that we don't know, in MTV, the node-level specification in kubevirt

Yes and it's unlikely that we will cover such advanced features with mappings. Preparing such configuration will rather stay a manual process. We can however try to support the users in the process i.e. with filtering the right VMs and choosing the pre-configured target provider/namespace.

we are definitely not going to prepare such configuration. btw, we also don't prepare network configuration but count on users to define the network attachment definitions and only map to them, so that makes sense

about choosing pre-configured target provider/namespace automatically - that's very complicated.
even taking a simpler case: an OCP instance, that has OpenShift Virtualization deployed on it, and have a namespace that selects hosts that non of them have virtualization enabled - we (in MTV) can't tell that this namespace cannot be used for VMs, let alone more complicated scenarios of nodes with host devices..

also take into account that in k8s it all works with "desired state" - you may want to migrate the vm to a new namespace or an existing namespace that don't yet have the resources the VMs need, but the VMs should still ask for them as you're going to add it later on

I think it comes down to: in which situation we need a feature without a concern because if we have concerns (warnings / errors) for each feature, we better let users filter concerns rather than features

@yaacov
Copy link
Member

yaacov commented Nov 7, 2023

I think it comes down to: in which situation we need a feature without a concern because if we have concerns (warnings / errors) for each feature, we better let users filter concerns rather than features

examples:
a. migrating in batches by features, first batch are small vms with one distk, and in the next batch vms that require GPU, and lastly if all is ok migrate vms with special nics
b. migrating only a subset of the vms
c. creating a re-migration of vms that failed the last migration because the had some issue with efi so we need to re-select only this ones

@ahadas
Copy link
Member

ahadas commented Nov 7, 2023

I think it comes down to: in which situation we need a feature without a concern because if we have concerns (warnings / errors) for each feature, we better let users filter concerns rather than features

ok, let's check this concretely for the features above:
dedicated CPUs: not all the features in ovirt exist in kubevirt (see those validations), so for those that don't exist we need a concern and for those that exist we can have a feature (?)
GPUs: we currently don't handle vGPUs in forklift so it would be better to have a concern that warns about this
Host devices: forklift doesn't handle that as well (I just saw that USB passthrough was added in kubevirt 1.1.0 that was released today)
NUMA: same as GPUs and host devices
and for persistent EFI/TPM we need concerns

@ahadas
Copy link
Member

ahadas commented Nov 7, 2023

I think it comes down to: in which situation we need a feature without a concern because if we have concerns (warnings / errors) for each feature, we better let users filter concerns rather than features

examples: a. migrating in batches by features, first batch are small vms with one distk, and in the next batch vms that require GPU, and lastly if all is ok migrate vms with special nics b. migrating only a subset of the vms c. creating a re-migration of vms that failed the last migration because the had some issue with efi so we need to re-select only this ones

@yaacov ok, that's something different - here you say when filtering of either concerns or features may be useful

for the first case, do we agree that it's more likely that such filtering would happen with features that we have concerns for? the way you put it is too broad since for that we can add features of, e.g., 'two disks', 'nic with special properties', etc

also think about our internal migration from RHV to OpenShift - this would mean to divide the plans that we created for each target namespace to multiple plans that are per-target namespace and per-feature. we have in the backlog a task about prioritization of the migrations - not to create plans with less VMs but to say "here is my plan, I want to start with certain VMs, then other group of VMs, and so on"

for the third case, if users need to create new plans for failed migrations, we did something incorrect. user are supposed to be able to restart a plan and then only the failed migrations would be retriggered

@yaacov
Copy link
Member

yaacov commented Nov 7, 2023

for the first case, do we agree that it's more likely that such filtering would happen with features that we have concerns for? the way you put it is too broad since for that we can add features of, e.g., 'two disks', 'nic with special properties', etc

hmm, from my side this is a discussion about how to choose vms for migration plan, not about this specific feature list.
the question I want answer to, is how to best choose vms for migration,
IMHO using only warnings, e.g. "migrate all vms without warnnigs" is less helpful then "migrate all vms that match this criteria"
for me, this discussion is about what are the best criteria , i less care about the list in this specific PR

this discussion is about how we can make the plan creation more useful,
the question we ask is not if this specific list of features is useful, the question is what list of features will be useful,
should we filter by number if disks ?
should we filter by labels ? what provider support something that resemble labels ?
also note that this feature list is provider specific, this discussion is not specific to overt or openshift, the question is how do we best help users to choose vms when creating a migration plan when they need to choose 100s of vms out of the 1000s a provider include

@yaacov
Copy link
Member

yaacov commented Nov 7, 2023

@ahadas , I think I understand the problem ... this is a PR that is a sub task of an issue:
#764

@rszwajko @ahadas let's continue in the issue as it is a better place for discussions

Supported features:
1. NUMA
2. GPUs/Host Devices
3. Persistent TPM/EFI
4. Dedicated CPU

Signed-off-by: Radoslaw Szwajkowski <[email protected]>
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.3% 0.3% Duplication

@rszwajko
Copy link
Contributor Author

Change summary in the force push above:

  1. merge similar features to limit the number of options in the filter
  2. in the feature column display labels
  3. rebase on top of the current main branch

@rszwajko rszwajko marked this pull request as ready for review November 30, 2023 11:23
@rszwajko
Copy link
Contributor Author

@yaacov please re-review.
note that the screenshots (far) above have been updated

@@ -64,3 +64,5 @@ autoattach
virt
multiqueue
filesystems
bootloader
typeahead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE !

@yaacov yaacov merged commit ab39a45 into kubev2v:main Nov 30, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants